Good day Mr. CEO & CFO. I’d like to thank you for accepting my invitation. Today, I’d like to address the questions you recently posed during our last Steering Committee meeting and provide you with a brief analysis that may pique your interest. This research was done using the Beer and Breweries data provided.
# Install all needed R libraries and packages.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggthemes)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
library(mapproj)
library(dplyr)
#library(stringi)
#library(stringr)
#Cookbook
#Load beers data
beers = read.csv(file.choose())
#Load breweries data
breweries = read.csv(file.choose())
#Quick look at the dataframe for both datasets.
head(beers)
## Name Beer_ID ABV IBU Brewery_id
## 1 Pub Beer 1436 0.050 NA 409
## 2 Devil's Cup 2265 0.066 NA 178
## 3 Rise of the Phoenix 2264 0.071 NA 178
## 4 Sinister 2263 0.090 NA 178
## 5 Sex and Candy 2262 0.075 NA 178
## 6 Black Exodus 2261 0.077 NA 178
## Style Ounces
## 1 American Pale Lager 12
## 2 American Pale Ale (APA) 12
## 3 American IPA 12
## 4 American Double / Imperial IPA 12
## 5 American IPA 12
## 6 Oatmeal Stout 12
head(breweries)
## Brew_ID Name City State
## 1 1 NorthGate Brewing Minneapolis MN
## 2 2 Against the Grain Brewery Louisville KY
## 3 3 Jack's Abby Craft Lagers Framingham MA
## 4 4 Mike Hess Brewing Company San Diego CA
## 5 5 Fort Point Beer Company San Francisco CA
## 6 6 COAST Brewing Company Charleston SC
#Use the breweries dataset to count the number of breweries present in each State.
#This data will be stored in the variable "breweries_Count_State."
breweries_Count_State = dplyr::count(breweries,State)
#Rename the column n to a recognizable name.
colnames(breweries_Count_State)[2] ="Count"
####Changed count to 2 to make it more legible
#Here I'm using the totals variable to display the number of breweries per State.
breweries_Count_State %>% ggplot(mapping = aes(x = State, y = Count)) +
geom_bar(stat = "identity", color = "darkgrey", fill = "darkslategray4" )+
xlab("States") + ylab("Count of Breweries")+
geom_text(aes(State, Count+1, label = Count), data = breweries_Count_State)+
ggtitle("Breweries per State")+
theme_tufte()
#Use the Inner join method to merge the beers and breweries dataset via the Brew_ID and Brewery_ID
beers_Breweries = merge(beers, breweries, by.x = "Brewery_id", by.y = "Brew_ID")
head(beers_Breweries)
## Brewery_id Name.x Beer_ID ABV IBU
## 1 1 Get Together 2692 0.045 50
## 2 1 Maggie's Leap 2691 0.049 26
## 3 1 Wall's End 2690 0.048 19
## 4 1 Pumpion 2689 0.060 38
## 5 1 Stronghold 2688 0.060 25
## 6 1 Parapet ESB 2687 0.056 47
## Style Ounces Name.y City
## 1 American IPA 16 NorthGate Brewing Minneapolis
## 2 Milk / Sweet Stout 16 NorthGate Brewing Minneapolis
## 3 English Brown Ale 16 NorthGate Brewing Minneapolis
## 4 Pumpkin Ale 16 NorthGate Brewing Minneapolis
## 5 American Porter 16 NorthGate Brewing Minneapolis
## 6 Extra Special / Strong Bitter (ESB) 16 NorthGate Brewing Minneapolis
## State
## 1 MN
## 2 MN
## 3 MN
## 4 MN
## 5 MN
## 6 MN
tail(beers_Breweries)
## Brewery_id Name.x Beer_ID ABV IBU
## 2405 556 Pilsner Ukiah 98 0.055 NA
## 2406 557 Heinnieweisse Weissebier 52 0.049 NA
## 2407 557 Snapperhead IPA 51 0.068 NA
## 2408 557 Moo Thunder Stout 50 0.049 NA
## 2409 557 Porkslap Pale Ale 49 0.043 NA
## 2410 558 Urban Wilderness Pale Ale 30 0.049 NA
## Style Ounces Name.y City
## 2405 German Pilsener 12 Ukiah Brewing Company Ukiah
## 2406 Hefeweizen 12 Butternuts Beer and Ale Garrattsville
## 2407 American IPA 12 Butternuts Beer and Ale Garrattsville
## 2408 Milk / Sweet Stout 12 Butternuts Beer and Ale Garrattsville
## 2409 American Pale Ale (APA) 12 Butternuts Beer and Ale Garrattsville
## 2410 English Pale Ale 12 Sleeping Lady Brewing Company Anchorage
## State
## 2405 CA
## 2406 NY
## 2407 NY
## 2408 NY
## 2409 NY
## 2410 AK
#Rename the columns Name.x and Name.y to be more meaningful names.
colnames(beers_Breweries)[2] = "Beer_Name"
colnames(beers_Breweries)[8] = "Brewery_Name"
#The following columns had NA values/missing data: ABV, IBU, and Styles(these were blank and Not N/A). To Tidy the data, we excluded those values from the dataset. We do not have a way to find those data points at this time.
beersNbreweries = na.omit(beers_Breweries)
#Compute the median the ABV (alcohol content) for each state.
ABV_Median_State = beersNbreweries %>% group_by(State) %>% dplyr::summarize(ABV_Median = median(ABV), count = n())
#Plot a bar chart
ABV_Median_State %>% ggplot(mapping = aes(x= State, y = ABV_Median)) +
geom_bar(stat = "identity",color = "darkgrey", fill = "darkslategray4")+
geom_text(aes(State, ABV_Median+0.002, label = ABV_Median), data = ABV_Median_State, size=2.3)+
theme_tufte()+
ggtitle("Median ABV per State") + xlab("STATES") + ylab ("ABV Mean")
#Compute the median IBU (International bitterness Unit)
IBU_Median_State = beersNbreweries %>% group_by(State) %>% dplyr::summarize(IBU_Median = median(IBU), count= n())
#Plot a bar chart
IBU_Median_State %>% ggplot(mapping = aes(x= State, y = IBU_Median )) +
geom_bar(stat = "identity",color = "darkgrey", fill = "darkslategray4")+
geom_text(aes(State, IBU_Median+1, label = IBU_Median), data = IBU_Median_State, size=4)+
theme_tufte()+
ggtitle("Median IBU per State") + xlab("STATES") + ylab ("IBU Mean")
# Which state has the maximum alcoholic (ABV) beer? Which state has the most bitter (IBU) beer?
# Finds the Max number within the ABV variable and the row number (index) in which the maximum ABV number may be found.
state_Max_ABV = beersNbreweries %>% summarise(Max_ABV= max(ABV), row_Index = which.max(ABV))
state_Max_ABV # The maximum ABV is 0.125 and can be found in row 8.
## Max_ABV row_Index
## 1 0.125 8
#Uses the row number (8) from above to find the State that coincides with the maximum ABV.
beersNbreweries$State[8]
## [1] " KY"
#The State is Kentucky (KY).
# Find the Max number within the IBU variable and the row number (index) in which the maximum IBU number may be found.
state_Max_IBU = beersNbreweries %>% summarise(Max_IBU= max(IBU), row_Index = which.max(IBU))
state_Max_IBU # The maximum IBU is 138 and can be found in row 1134.
## Max_IBU row_Index
## 1 138 1134
#Uses the row number (1134) from above to find the State that coincides with the maximum IBU.
beersNbreweries$State[1134]
## [1] " OR"
#The State is Oregon (OR).
#Scatter Plot showing the distribution of the ABV versus the IBU variables.
g = beersNbreweries %>% ggplot(mapping = aes(x = ABV, y = IBU, position ="jitter")) +
geom_point(color = "darkslategray4")+
theme_tufte()+
xlab("ABV") + ylab("IBU")+
ggtitle("Relationship between bitterness and Alcohol content")+
geom_smooth(aes(x = ABV, y = IBU))
ggplotly(g)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
# The relationship between the ABV and IBU appears to be positively linear. As the ABV increases we can see that IBU also increases. There is a cluster around the 0.050 ABV value, indicating that a great majority of beers try to stay near that number.In our data set most of the ABV data was below the 0.100 ABV value. There were only two beers that went above that number.However, the IBU values was not as high as expected given that the trend is linear. These two beers were in the States of Indiana and Kentucky.
#Budweiser would also like to investigate the difference with respect to IBU and ABV between IPAs (India Pale Ales) and other types of Ale (any beer with “Ale” in its name other than IPA). You decide to use KNN classification to investigate this relationship. Provide statistical evidence one way or the other. You can of course assume your audience is comfortable with percentages … KNN is very easy to understand conceptually.
#Due June 26th
# (grep("SUSHI", df_Baltimore_Restaurants$name))
# length(grep("SUSHI", df_Baltimore_Restaurants$name))
# sum(grepl("SUSHI", df_Baltimore_Restaurants$name))
# which(grepl("SUSHI", df_Baltimore_Restaurants$name))
# which(grepl("0.125",beersNbreweries$State))
Comment on the summary statistics and distribution of the ABV variable.